<h1>Test Guidelines</h1>
<p>The COCO data can be obtained from the <a href="#download">download page.</a> Each challenge has a different training / validation / testing set, details are provided on the download page and summarized here:</p>
<div class="json">
  <div class="jsonktxt">2014 Train/Val</div><div class="jsonvtxt"><a href="#detection-2015">Detection 2015</a>, <a href="#captions-2015">Captioning 2015</a>, <a href="#detection-2016">Detection 2016</a>, <a href="#keypoints-2016">Keypoints 2016</a></div>
  <div class="jsonktxt">2014 Testing  </div><div class="jsonvtxt"><a href="#captions-2015">Captioning 2015</a></div>
  <div class="jsonktxt">2015 Testing  </div><div class="jsonvtxt"><a href="#detection-2015">Detection 2015</a>, <a href="#detection-2016">Detection 2016</a>, <a href="#keypoints-2016">Keypoints 2016</a></div>
  <div class="jsonktxt">2017 Train/Val</div><div class="jsonvtxt"><a href="#detection-2017">Detection 2017</a>, <a href="#keypoints-2017">Keypoints 2017</a>, <a href="#stuff-2017">Stuff 2017</a></div>
  <div class="jsonktxt">2017 Testing  </div><div class="jsonvtxt"><a href="#detection-2017">Detection 2017</a>, <a href="#keypoints-2017">Keypoints 2017</a>, <a href="#stuff-2017">Stuff 2017</a></div>
  <div class="jsonktxt">2017 Unlabeled</div><div class="jsonvtxt fontBlue">[optional data for any competition]</div>
</div>
<p>The recommended training data for participating in any COCO challenge consists of the corresponding COCO training set. Validation data may also be used for training when submitting results on the test set (although starting in 2017 the validation set only has 5K images, so the benefits of this are minimized). Note that the 2017 train/val data includes the same images as the 2014 train/val data just organized differently, so there is no benefit of using 2014 training data for the 2017 competition.</p>
<p><i>External data</i> of any form is allowed. Any and all external data used for training must be specified in the "method description" when uploading results to the server. We emphasize that any form of annotation or use of the COCO test sets for supervised or unsupervised training is strictly forbidden. <b>Note: please explicitly specify any and all external data used for training in the "method description" when uploading results to the evaluation server.</b></p>

<h1>Test Set Splits</h1>
<p>Prior to 2017, the test set had four splits (dev / standard / reserve / challenge). Starting in 2017, we simplified the test set to only the dev / challenge splits, with the other two splits removed. The original purpose of the four splits was to protect the integrity of the challenge while giving researchers flexibility to test their system. After multiple years of running the challenges, we saw no evidence of overfitting to specific splits (the output space complexity and the test set size protect against simple attacks such as <a href="http://blog.mrtz.org/2015/03/09/competition.html" targe="_blank">wacky boosting</a>). Therefore, we simplified participation in the challenges accordingly in 2017.</p>

<h2>2017 Test Set Splits</h2>
<p>The 2017 COCO Test set consists of ~40K test images. The test set is divided into two roughly equally sized splits of ~20K images each: <i>test-dev</i> and <i>test-challenge</i>. Each is described in detail below. Additionally, when uploading to evaluation servers, we now allow submission of the 5K val split for debugging the upload process. Note that test set guidelines have changed in 2017, you can see the 2015 guidelines for old usage information. The test splits in 2017 are as follows:</p>
<div class="json fontMono">
  <table class="datasetSplits">
    <tr><th>split</th><th>#imgs</th><th>submit limit</th><th>scores available</th><th>leaderboard</th></tr>
    <tr><td>Val</td><td>~5K</td><td>no limit</td><td>immediate</td><td>none</td></tr>
    <tr><td>Test-Dev</td><td>~20K</td><td>5 per day</td><td>immediate</td><td>year-round</td></tr>
    <tr><td>Test-Challenge</td><td>~20K</td><td>5 total</td><td>workshop</td><td>workshop</td></tr>
  </table>
</div>
<p><b>Test-Dev</b>: The test-dev split is the default test data for testing under general circumstances. Results in papers should generally be reported on test-dev to allow for fair public comparison. The number of submissions per participant is limited to 5 uploads per day to avoid overfitting. Note that only a single submission per participant can be published to the public leaderboard (a paper, however, may report multiple test-dev results). The test-dev server will remain open year-round.</p>
<p><b>Test-Challenge</b>: The test-challenge split is used for COCO challenges hosted on a yearly basis. Results are revealed during the relevant workshop (typically at ECCV or ICCV). The number of submissions per participant is limited to a maximum of 5 uploads total over the length of the challenge. If you submit multiple entries, the best results based on <i>test-dev</i> AP is selected as your entry for the competition. Note that only a single submission per participant can be published to the public leaderboard. The test-challenge server will remain open for a fixed amount of time prior to each year's competition.</p>
<p>The images belonging to each split are defined in image_info_test-dev2017 (for test-dev) and image_info_test2017 (for combined test-dev and test-challenge). Info for test-challenge images is not explicitly provided. Instead, results must be submitted on the full test set (both test-dev and test-challenge) when participating in the challenge. This serves two goals. First, participants get automatic feedback on their submission by seeing evaluation results on test-dev prior to the challenge workshop. Second, after the challenge workshop, it gives future participants an opportunity to compare against challenge entries on the test-dev split. We emphasize that when submitting to the full test set (image_info_test2017), results must be generated on all images without differentiating between the splits. Finally, we note that 2017 dev / challenge splits contain the same images as the 2015 dev / challenge splits so results across years are directly comparable.</p>
<p>It is not acceptable to create multiple accounts for a single project to circumvent the submission upload limits. If a group publishes two papers describing unrelated methods, separate user accounts may be created. For challenges, a group may create multiple accounts only if submitting substantially different methods to the challenge (e.g., based on different papers). To debug the upload process, we allow participants to submit unlimited evaluation results on the val set.</p>

<h2>2015 Test Set Splits</h2>
<p>This test set was used for 2015 and 2016 detection and keypoint challenges. <i>It is no longer used and the evaluation servers are closed</i>. However, for historical reference, you may obtain full information on the 2015 test splits by clicking <a href="#Guidelines2015" data-toggle="collapse" aria-expanded="false" aria-controls="gsutil-rsync"><i class="glyphicon glyphicon-chevron-right"></i><i class="glyphicon glyphicon-chevron-down"></i>here</a>.</p>
<div id="Guidelines2015" class="collapse">
  The 2015 COCO Test set consists of ~80K test images. To limit overfitting while giving researchers more flexibility to test their system, we have divided the test set into four roughly equally sized splits of ~20K images each: <i>test-dev</i>, <i>test-standard</i>, <i>test-challenge</i>, and <i>test-reserve</i>. Submission to the test set automatically results in submission on each split (identities of the splits are not publicly revealed). In addition, to allow for debugging and validation experiments, we allow researcher <i>unlimited</i> submission to test-dev. Each test split serves a distinct role; details below.<br/><br/>
  <div class="json fontMono">
    <table class="datasetSplits">
      <tr><th>split</th><th>#imgs</th><th>submission</th><th>scores reported</th></tr>
      <tr><td>Test-Dev</td><td>~20K</td><td>unlimited</td><td>immediately</td></tr>
      <tr><td>Test-Standard</td><td>~20K</td><td>limited</td><td>immediately</td></tr>
      <tr><td>Test-Challenge</td><td>~20K</td><td>limited</td><td>challenge</td></tr>
      <tr><td>Test-Reserve</td><td>~20K</td><td>limited</td><td>never</td></tr>
    </table>
  </div>
  <p>Test-Dev: We place <i>no limit</i> on the number of submissions allowed to test-dev. In fact, <i>we encourage use of the test-dev for performing validation experiments</i>. Use test-dev to debug and finalize your method before submitting to the full test set.</p>
  <p>Test-Standard: The test-standard split is the default test data for the detection competition. <i>When comparing to the state of the art, results should be reported on test-standard</i>.</p>
  <p>Test-Challenge: The test-challenge split is used for COCO challenges. Results will be revealed during the relevant workshop.</p>
  <p>Test-Reserve: The test-reserve split is used to protect against possible overfitting. If there are substantial differences between a method's scores on test-standard and test-reserve this will raise a red-flag and prompt further investigation. Results on test-reserve will not be publicly revealed.</p>
  <p>We emphasize that except for test-dev, results <i>cannot</i> be submitted to a single split and must instead be submitted on the full test set. A submission to the test set populates three leaderboards: test-dev, test-standard and test-challenge (the challenge leaderboard will not be revealed until the relevant workshop). It is not possible to submit to test-standard without submitting to test-challenge or vice-versa (however, it is possible to submit to the test set without making results public, see below). The identity of the images in each split is <i>not</i> revealed, except for test-dev.</p>
  <p>The test-dev 2015 set is a subset of the 2015 Testing set. The specific images belonging to test-dev are listed in the "image_info_test-dev2015.json" file available on the <a href="#download">download page</a> as part of the "2015 Testing Image info" download. As discussed, we place <i>no limit</i> on the number of submissions allowed on test-dev. Note that while submitting to test-dev will produce evaluation results, doing so will not populate the public test-dev leaderboard. Instead, submitting to the full test set populates the test-dev leaderboard. This limits the number of results displayed on the test-dev leaderboard.</p>
  <p>Test-dev should be used only for validation and debugging: in a publication <i>it is not acceptable to report results on test-dev only</i>. However, for validation it is acceptable to report results of competing methods on test-dev (obtained from the public leaderboard). While test-dev is prone to some overfitting, we expect this may still be useful in practice. We emphasize that final comparisons should always be performed on test-standard. <i>Note: these rules no longer apply in 2017.</i></p>
  <p>The differences between the validation and test-dev sets are threefold: guaranteed consistent evaluation of test-dev using the evaluation server, test-dev cannot be used for training (annotations are private), and a leaderboard is provided for test-dev, allowing for comparison with the state-of-the-art. We note that the continued popularity of the outdated PASCAL VOC 2007 dataset partially stems from the fact that it allows for simultaneous validation experiments and comparisons to the state-of-the-art. Our goal with test-dev is to provide similar functionality (while keeping annotations private).</p>
</div>

<h2>2014 Test Set Splits</h2>
<p>The 2014 test set is only used for the <a href="#captions-2015">captioning challenge</a>. Please see the <a href="#captions-eval">caption eval</a> page for details.</p>
